Bermejo , Carolin Strobl Random forest Gini importance favors SNPs with large minor allele frequency
نویسندگان
چکیده
The use of random forests is increasingly common in genetic association studies. The variable importance measure (VIM) that is automatically calculated as a by-product of the algorithm is often used to rank polymorphisms with respect to their association with the investigated phenotype. Here we investigate a characteristic of this methodology that may be considered as an important pitfall, namely that common variants are systematically favored by the widely used Gini VIM. As a consequence, researchers may overlook rare variants that contribute to the missing heritability. The goal of the present paper is three-fold: 1) to assess this effect quantitatively using simulation studies for different types of random forests (classical random forests and conditional inference forests, that employ unbiased variable selection criteria) as well as for different importance measures (Gini and permutation-based), 2) to explore the trees and to compare the behaviour of random forests and the standard logistic regression model in order to understand the statistical mechanisms behind the preference for common variants, and 3) to summarize our results and previously investigated properties of random forest VIMs in the context of association studies and to make practical recommendations regarding the methodological choice. The codes implementing our study are available from the companion website: http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/020_professuren/boulesteix/ginibias/
منابع مشابه
Random forest Gini importance favours SNPs with large minor allele frequency: impact, sources and recommendations
The use of random forests is increasingly common in genetic association studies. The variable importance measure (VIM) that is automatically calculated as a by-product of the algorithm is often used to rank polymorphisms with respect to their ability to predict the investigated phenotype. Here, we investigate a characteristic of this methodology that may be considered as an important pitfall, n...
متن کاملCarolin Strobl , Torsten Hothorn , Achim Zeileis Party on ! A New , Conditional Variable Importance Measure for Random Forests Available in the party Package
متن کامل
Danger: High Power! – Exploring the Statistical Properties of a Test for Random Forest Variable Importance
Random forests have become a widely-used predictive model in many scientific disciplines within the past few years. Additionally, they are increasingly popular for assessing variable importance, e.g., in genetics and bioinformatics. We highlight both advantages and limitations of different variable importance scores and associated testing procedures. For the test of Breiman and Cutler (2008), w...
متن کاملZeileis Danger : High Power ! – Exploring the Statistical Properties of a Test for Random Forest Variable
Random forests have become a widely-used predictive model in many scientific disciplines within the past few years. Additionally, they are increasingly popular for assessing variable importance, e.g., in genetics and bioinformatics. We highlight both advantages and limitations of different variable importance scores and associated testing procedures, especially in the context of correlated pred...
متن کاملParty on ! A New
Random forests are one of the most popular statistical learning algorithms, and a variety of methods for fitting random forests and related recursive partitioning approaches is available in R. This paper points out two important features of the random forest implementation cforest available in the party package: The resulting forests are unbiased and thus preferable to the randomForest implemen...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011